Comparative genomics study of inverted repeats in bacteria 
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We investigate the number of inverted repeats observed in 37 complete genomes of bacteria. The 
number of inverted repeats observed is much higher than expected using Markovian models of DNA 
sequences in most of the eubacteria. By using the information annotated in the genomes we discover 
that in most of the eubacteria the inverted repeats of stem length longer than 8 nucleotides prefer- 
entially locate near the 3' end of the nearest coding regions. We also show that IRs characterized by 
large values of the stem length locate preferentially in short non-coding regions bounded by two 3' 
ends of convergent genes. By using the program TransTerm recently introduced to predict transcrip- 
tion terminators in bacterial genomes, we conclude that only a part of the observed inverted repeats 
fuUfills the model requirements characterizing rho-independent termination in several genomes. 
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An inverted repeat (IR) in DNA sequences provide the 
necessary condition for the potential existence of a hair- 
pin structure in the transcribed messenger RNA and/or 
cruciform structures in DNA j|] . Inverted repeats play an 
important role for regulation of transcription and trans- 
lation. Examples are the role of the IR located in the pro- 
moter region of the SIO ribosomal protein operon (^-^] 
and the role of IR in the lac operator ||J^. It has also 
been proposed that hairpins play an important role in the 
control of transcription, translation and other biological 
functions Hairpin- or cruciform-binding proteins 

have been identified from several species. These results 
suggest that regulatory hairpins may be involved in tran- 
scription of a number of genes . The existence in vivo 
of hairpin structures of messenger RNA during transcrip- 
tion has been demonstrated ]lO|-p^. Hairpin structures 
are often associated with rho-independent intrinsic ter- 
minators of genes in several bacteria. The presence of 
such intrinsic terminators has been observed in E. coli 
p5| . Rho-independent intrinsic terminators have also 
been detected in other bacteria as, for example. Strep- 
tococcus pneumoniae jp^ , Pseudomonas aeruginosa [ pT| , 
Myxococcus xanthus [ |l8| . Streptococcus equisimilis H46A 
[^9| and Chromatium vinosum D [20|] . Hairpin structures 
can occur at the mRNA when an IR is present in the 
DNA sequence. For example, the sequence 5'aGGAATC- 
GATCTTaacgAAGATCGATTCCaS' is a sequence hav- 
ing a sub-sequence GGAATCGATCTT which is the IR 
of AAGATCGATTCC. This IR can form a hairpin hav- 
ing a stem of length 12 nucleotides and a loop (aacg) of 
length 4 nucleotide in the transcribed RNA. The num- 
ber of IRs has been investigated with bioinformatics 
methods in long DNA sequences of cukaryotic (human 
and yeast) and bacterial (E.coli) DNA and in the 
complete genomes of eubacterium Haemophilus influen- 
zae , archaebacterium Methanococcus jannaschii and 
cyanobacterium Synechocystis sp. PCC6803 These 
studies have shown that inverted repeats are rather abun- 
dant in E. coli and Haemophilus influenzae, poorly abun- 
dant in Methanococcus jannaschii and with no enrich- 



ment (with respect to a Bernoullian assumption about 
DNA sequences) in Synechocystis |pT| , ^ . 



DATA AND METHODS 

In the present study, we investigate 37 complete 
genomes of bacteria recently sequenced (all the bacte- 
rial complete genomes publicly available at the time of 
our study) . The set consists of 8 archaebacteria, 1 aquifi- 
cales, 1 thermotogales, 2 spirochetales, 5 chlamydiales, 1 
deinococcus, 1 cyanobacterium, 12 proteobacteria (pur- 
ple bacteria) and 6 firmicutes (gram positive). The an- 
alyzed DNA totals 77.8 millions base pairs. In the com- 
pleted genomes we search with a specialized computer 
program all the inverted repeats of stem length £ ranging 
from 4 to 20 and loop (spacer) length m ranging from 3 to 
10. These are typical boundaries in the range of the ones 
used in the literature for the investigation of IRs, in DNA 
The results of our investigation are 
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illustrated in Figure 1 and summarized in Table I. The 
available genomes differ the one from the other with re- 
spect to the GG content and the degree of its fluctuation 
along the genome. In our study, each detected IR is the 
IR of maximal stem length i located in a given DNA 
position. In other words, we check that the first two nu- 
cleotides immediately out of the stem region have not 
a palindromic counterpart for each observed IR. Within 
this definition, the number of IRs expected in a genome 
under the simplest assumption of a random Bernoullian 
DNA is given by the equation 



ne.(^,m) - iV(l - 2PaPt - 2P,Pgf 

X {2PaPt+2P,PgY, 



(1) 



where N is the number of nucleotides in the genome se- 
quence and Pa, Pc, Pg and Pt are the observed frequen- 
cies of nucleotides. Eq. (1) shows that the number of 
expected IRs is independent of m whereas it depends on 
the CG content of the genome. The CG content can vary 
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considerably across different regions of the same genome. 
Moreover, a BcrnouUian description of DNA sequences 
provides just a a rough approximation of the statistical 
properties observed in real genome. For this reason, we 
decide to compare the results obtained in real genomes 
with the ones obtained by generating a computer gener- 
ated first-order Markov genome having the same proba- 
bility matrix of dinucleotides empirically observed within 
each non-overlapping window of 10,000 nucleotides for 
each genome. The occurrence of IRs detected in the com- 
puter generated genomes are used for comparison in Fig- 
ure 1 and to obtain the values used to illustrate the 
empirical results summarized in Table I. 
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FIG. 1. Contour plot of the decimal logarithm of the total 
occurrence of IRs of stem length i and loop length m for 5 rep- 
resentative genomes (bottom panels) compared with the dec- 
imal logarithmic occurrence of corresponding computer gen- 
erated first-order Markovian data (top panels). From left to 
right we show (a) the archaebacterium Archaeoglobus fulgidus, 
(b) the Chlamydia pneumoniae CWL029, (c) the proteobac- 
terium Escherichia coli, (d) the firmicute Bacillus subtilis and 
(e) the cyanobacterium SynechocysUs sp.. The numbers of 
color scale are decimal logarithm of the total occurrence of 
IRs. 

In Figure 1 we show the contour lines of the decimal 
logarithm of the number of occurrence of IRs as a func- 
tion of the stem length £ and of the loop length m for 
5 representative genomes. To take into account the dif- 
ferences in length, CG content and Markovian charac- 
terization observed in the genomes we also show in Fig- 
ure I (top panels of the figure) the results obtained for 
the same investigation performed with the corresponding 
computer generated Markovian genomes. The selected 
genomes are: Archaeoglobus fulgidus (archacbacteria). 
Chlamydia pneumoniae CWL029 (chlamydialcs). Bacil- 
lus subtilis (firmicutcs), Escherichia coli (proteobacteria) 
and SynechocysUs sp. (cyanobacteriae) . In the absence 
of the statistical uncertainties detected at the noise level, 
Markovian generated data show contour lines, which are 
straight lines parallel to the horizontal axis (loop length 



to) . This is consistent with the theoretical result of Eq. 
(1) obtained for a BernouUian DNA sequence. In fact, 
the theoretical prediction of Eq. (1) does not depends 
on m. Moreover, the distance between two successive 
contour lines is approximately constant. This second 
observation indicates that, once again consistently with 
the prediction of Eq. (1), the number of occurrences of 
inverted repeats exponentially decreases with the stem 
length £ in a local Markov model of DNA sequences. 
The results obtained in real genomes (bottom panels of 
Figure 1) are genome dependent. Some genomes as Ar- 
chaeoglobus fulgidus and SynechocysUs sp. show a general 
pattern hardly distinguishable from the one observed in 
computer generated data whereas Chlamydia pneumo- 
niae CWL029, Bacillus subtilis and Escherichia coli are 
characterized by the occurrence of inverted repeats which 
are not explained by a first-order Markovian model of 
DNA. For these three genomes, but they are represen- 
tative of several others genomes, it is worth noting that 
the contour lines of Figure 1 show both (i) a value of the 
occurrence of inverted repeats much larger than expected 
for a first-order Markovian model for large values of £ and 
(ii) an occurrence of inverted repeats which is strongly 
dependent on the specific value of m for large values of 
£. We verify that higher-order Markovian models up to 
the fifth-order also fail to reproduce these empirical re- 
sults. In other words these empirical results cannot be 
ascribed to the strong bias up to the esamer level present 
in several genomes. In agreement with previous studies 
devoted to the search of inverted repeats in long DNA se- 
quences (Schroth and Shing Ho, 1995; Cox and Mirkin, 
1997) we detect the existence of different levels of en- 
hancement of the number of observed inverted repeats 
in different species of bacteria. The results presented in 
Figure 1 are just illustrative of the varied behavior ob- 
served in 5 different case. The complete results obtained 
by our investigation are summarized in Table I where we 
group the 37 investigated genomes. 

RESULTS AND DISCUSSION 

In Table I we report the observed number of inverted 
repeats as a function of the stem length £ for each in- 
vestigated genome. The genomes are listed in groups 
separated by a horizontal line. From top to bottom, 
the groups are archacbacteria, chlamydialcs, firmicutcs, 
proteobacteria and others. This last group includes 
aquificales, spirochetales, deinococcales, cyanobacteria 
and thermotogales. To limit the size of the Table we 
sum up the occurrence observed for different values of 
3 < m < 10. To make comparison possible between 
genomes of different sizes, the values are normalized to 
one million base pairs. In the Table we also provide (in 
parenthesis) the value expected for computer generated 
genomes characterized by a local first-order Markovian 
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model. The values are calculated by comparing the 
empirical occurrence of IR with respect to the one pre- 
dicted by a first-order Markovian process. The is 
calculated by comparing the empirical occurrence of in- 
verted repeats with the occurrence of computer gener- 
ated data when (. is varying from 4 to 20 (the number 
of IRs with I = A and £ = 5 are not shown in Table 
I for lack of space. They are available at the web site 



http://lagash.dft.unipa.it/IR.htm]). The calculation 
is done by comparing the two distributions using six bins. 
In fact, the number of IRs with £ > 8 are summed up to- 
gether in the presentation of Table I and in the calcu- 
lation to compensate the exponential decrease observed 
in the number of IRs when £ increases . The obtained x^ 
values are in all except one case larger than 30 implying 
that the p value is always below (and often much below) 
1 X 10~^. The only exception is the aquificales Aquifex 
aeolicus that present a p value of 0.52. To make a direct 
comparison between different genomes, in Table I we use 
a color code for the contribution to the x^ of each dif- 
ferent number of IRs of stem length £. Specifically, for 
each value of £ we compute (uobs - nMarkov ) / y/nMarkov ■ 
The square of this quantity directly contributes to the 
X^ value. In the Table, for values of this parameter lower 
than -3 we use a blue character. A black character is 
used when this parameter is between -3 and 3, for values 
larger than 3 we use a red character and for values larger 
than 10 the number of IRs is purple. 

Our results show that the number of inverted repeats 
is much higher than the one theoretically expected for 
a first-order Markovian DNA of the same composition 
in chlamydiales, firmicutes and proteobacteria. Devia- 
tions in the observed number of IRs that contribute more 
than 100 to the x^ are detected in several genomes es- 
pecially when £ > 8. Moreover, we observe deviations 
of the number of IRs that contribute more than 9 to 
the x^ in the large majority of genomes for almost all 
the £ values. Among eubacteria only Aquifex aeolicus 
show no enrichment, whereas Deinococcus radiodurans 
and Synechoeystis sp. show less IRs than predicted by a 
first-order Markovian DNA. In contrast, a general pat- 
tern does not emerges in archaebactcria. In three of 
the eight complete genomes investigated the number of 
empirically observed IRs is comparable with the one ex- 
pected by using a Markovian model, whereas in the cases 
of Aeropyrum pernix, Halobacterium sp., Methanococcus 
jannaschii, Methanobacterium thermoautotrophicum and 
Thermoplasma acidophilum IRs are slightly more fre- 
quent or more frequent than expected. In spite of this 
variety of behavior, a general remark is that the number 
of IRs in archaebactcria is, in most cases, much lower 
than the one observed in eubacteria. 
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FIG. 2. Ratio /i between the percentage of IRs found in 
non-coding regions and the percentage of non-coding regions 
present in the entire genome. Each vertical line refers to a 
genome. The 37 investigated genomes are grouped as archae- 
bacteria (I), chlamydiae (II), firmicutes (III), proteobacteria 
(IV) and otliers (V). Within each group, tlie order of the 
different genomes is the same as the one given in Table I. 
Different colors of the symbol refer to different values of £. 
Specifically, we use green for i = 6, blue for i = 7, orange for 
£ = 8, and red for £ > 8. 

The observed excess of the number of IRs in most of 
the complete genomes of eubacteria cannot be due to a 
statistical fluctuation. Hence, it is important to relate 
this statistical observation to known biological functions. 
One explanation of the abundance of such inverted re- 
peats in eubacteria is that they are used to code rho- 
independent hairpin in RNA during protein transcription 
[|l|,|l]j23[|2|]. Motivated by this fact, we investigate the 
location of each inverted repeats we find with respect to 
the biological information present in the annotation of 
each genome. The first investigation concerns the per- 
centage of inverted repeats located in the non-coding 
DNA regions for each genome and for different values 
of the stem length £ > 6. The results are summarized in 
Figure 2 where we show for each genome the ratio /i be- 
tween the percentage of IRs found in non-coding regions 
and the percentage of non-coding regions present in the 
entire genome. Each vertical line refers to a genome. The 
grouping of genomes and their sequence is the same as 
the one given in Table I. Different colors refer to different 
values of £. Specifically, we use green squares for £ = 6, 
blue squares for £ — 7, orange squares for £ ~ 8, and 
red squares for £ > 8. A null random hypothesis would 
suggest that /i w 1 for all values of £. Several genomes 
of groups II, III, IV and V show values of ^ which are 
much larger than one. The value of /i always increases 
when £ increases assuming largest values for large values 
of £. For most genomes, this implies that a large num- 
ber of inverted repeats are almost exclusively located in 
the non-coding regions when the stem length is longer 
than 8. However, this behavior is not so pronounced in 
all investigated genomes. It is worth noting that the in- 
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verted repeats of most archaebacteria are characterized 
by moderately large values of fi (see group I in Figure 
2). Moreover, this different behavior is also observed for 
some eubacteria such as, for example, Mycobacterium tu- 
bercolosis, and Treponema pallidum. 

A more quantitative test is obtained by performing 
a test of the null hypothesis that the number on 
IRs found in non-coding regions is proportional to the 
percentage of non-coding regions present in the entire 
genome. The p-values associated to the above test 
for £ = 6 are below 1% confidence level in all but three 
genomes, specifically Aeropyrum pernix, Pseudomonas 
aeruginosa and Treponema pallidum. For IRs with stem 
length £ > 8 only Halobacterium sp. and Treponema pal- 
lidum have a p- value larger than 1%. This test shows that 
for all the value of £ considered the IRs are preferentially 
located in non-coding DNA. 

Non-coding regions of complete genomes can be classi- 
fied with respect to the orientation of the bounding cod- 
ing regions. Coding regions can be "divergent" (two 5' 
ends bounding the non-coding region), "unidirectional" 
(examples are a spacer between two genes in one operon 
or a spacer between two consecutive unidirectional opcr- 
ons) or "convergent" (two 3' ends bounding the non- 
coding region). The analysis of the statistical location 
of IRs in these biologically different non-coding regions 
provides relevant information about the potential bio- 
logical role of some of the detected IRs. To perform a 
statistical analysis of the location of IRs in these three 
types of spacer we split the set of the non-coding re- 
gions in three subsets: (i) non-coding regions between 
two 5' ends of two "divergent" genes (we address these 
non-coding regions as type A regions); (ii) non-coding 
regions between a 3' end and a 5' end of two different 
genes (type B regions) and (iii) non-coding regions be- 
tween two 3' ends of two "convergent" genes (type C re- 
gions) . The genes bounding non-coding regions of type A 
and C certainly belong to two different operons, whereas 
the genes bounding non-coding regions of type B may 
or may not belong to the same operon. The statisti- 
cal properties of the length of the non-coding regions 
are different for the three groups. The non-coding re- 
gions belonging to type A are in average longer than the 
non-coding regions belonging to the other two groups. 
A statistical characterization of all genomes shows that 
in (almost) all the considered genomes the the probabil- 
ity density function of the length of non-coding regions 
of type A shows a broad maximum at a length of 150- 
200 nucleotides and decays to zero for small and large 
values of non-coding region length. The probability den- 
sity function of length of non-coding regions belonging 
to type B is an exponentially decaying function approx- 
imately. The non-coding regions belonging to type C 
behave differently in genomes of different organisms. In 
many purple bacteria the probability density function of 
length of non-coding regions has a sharp peak located 



at a length of 40-60 nucleotides approximately, whereas 
in other genomes an approximately monotonic decaying 
behavior is observed. 

We investigate the distribution of IRs in non-coding re- 
gions belonging to different groups A, B and C. We make 
the null hypothesis that the probability of having an IR in 
any non-coding part of the genome is just proportional 
to the length of that non-coding region. For example, 
the probability of finding an IR in a non-coding region 
belonging to type A is given by the total length of non- 
coding regions of this type divided by the total length of 
the non-coding regions of the genome. For each genome 
and for each value oi £ (i = 6, £ = 7, £ = 8 and £ > 8) 
we compute the p- value associated to the x^ test of the 
above null hypothesis. Table II shows these p- values for 
the considered genomes. We note that in archaebacteria 
the IRs are distributed in the three groups of non-coding 
regions in good agreement with the null hypothesis. The 
only exception is the Halobacterium genome for £ = 6 
and ^ > 8. On the other hand, for many eubacteria the 
p- values are very small especially for longer IRs. In these 
genomes, a direct inspection of the number of inverted re- 
peats located in each group of non-coding regions shows 
that the IRs tend to be preferentially located in non- 
coding regions of type C when £ > 8 for the genomes 
disproving the null hypothesis. Low p- values are also ob- 
served for low values of £ (we compute p- values also for 
£ = 4 and 5. They are not shown in Table II for lack of 
space but available on-line). The reason of these low p- 
values for low values of £ is different from the one for the 
large values of £ for most genomes. For example, most of 
the genomes (31 over 37) show that the number of IRs 
located in the type A non-coding regions is exceeding the 
number expected under the null hypothesis by a six per- 
cent in average for £ = A. This result suggests a potential 
biological role of short IRs located in type A non-coding 
regions which is different from the biological role of long 
IRs located in type C non-coding regions. 

Next we investigate the statistical properties of the dis- 
tance of the IRs located in non-coding regions from the 
nearest coding region. This investigation is performed 
by dividing IRs in 4 groups. The first two groups con- 
tain IRs located in type A and type C non-coding regions 
previously defined. IRs located in type A non-coding re- 
gions are denoted as A5' whereas IRs located in type C 
non-coding regions arc denoted C3'. IRs found in type 
B regions are divided in two subgroups depending on the 
condition that the considered IRs is closer to a 5' end 
(we address this subset as B5') or to a 3' end (we address 
this subset as B3'). 

For each group and for each value of stem length £, we 
estimate the mean distance of IR from the closest coding 
regions. This is done by analysing the four groups of IRs 
separately. To obtain mean values which are statistically 
reliable for IRs characterized by both small and large 
values of £ , we perform our analysis on the six genomes 



4 



having the largest number of inverted repeats. These 
genomes are Bacillus halodurans, Bacillus subtilis, Neis- 
seria meningitidis serogroup B, Escherichia coli, Pseu- 
domonas aeruginosa and Vibrio cholerae Chr I. In Table 
III we summarize the average distance of the IRs from 
the closest gene boundary for the four groups described 
above for ^ = 6 and £ > 8. We select these values of £ to 
limit the size of the Table and because they are represen- 
tative of the behavior observed for small and large values 
of £. From the Table we note that the average distance 
of the IRs from the nearest coding region decreases when 
two conditions are simultaneously fulfilled. The first is 
that the considered IR has a stem length longer than 
8 nucleotides {£ > 8) and the second is that the IR is 
nearby the 3' end of a coding regions. In fact, for the B3' 
subset we observe that the mean distance from the 3' end 
of the coding region decreases when i increases from 6 to 
a value larger than 8 for all the considered genomes. The 
same behavior is observed in the C3' subset. Indeed in 
this last case the decrease of the mean distance is even 
more pronounced than in the previous one. On the con- 
trary, when the IR is closer to a 5' end of the nearest 
coding region the mean distance remains approximately 
the same when £ increases from 6 to values larger than 
8 both for the case A5' and for the case B5'. Only one 
exception to this general trend is detected. It is the case 
of IRs of V. cholerae Chr I in the B5' subset. 
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FIG. 3. Probability density functions P{d) to find an IR 
nearby the 5' end of a coding region in a type A non-coding 
region (left panels) or an IR nearby the 3' end of a coding re- 
gion in a type C non-coding region (right panels) when £ = 6, 
£ = 7, £ = 8, and £ > 8. Data refers to the complete genome 
of E. coli. 

In Figure 3 we illustrate the decrease of the mean dis- 
tance occurring for most eubacterial genomes rich of long 
inverted repeats by considering the case of E. coli. In the 
figure we show the empirical probability density functions 
P{d) of (i) the distance d^r from a 5' end of an IR located 



in a type A non-coding region (left panels) and (ii) the 
distance d^' from a 3' end of an IR located in a type C 
non-coding region (right panels) for £ = 6, ^ = 7, £ = 8, 
and £ > 8. The left panels are always broad probability 
density functions not characterized by a sharp distance. 
In the right panels, the probability density function is 
also rather broad for £ = 6. However, when £ increases 
above six P{d3') progressively displays a clear peak local- 
ized around d^' « 20 nt. The preferential localization of 
the IRs characterized by a long stem nearby the 3' end 
of a coding regions supports the biologically motivated 
hypothesis that these structure may play the role of in- 
trinsic terminators of the transcription process. However, 
the parallel analysis of 37 complete genomes summarized 
in Tables I and II shows that this is not a general feature 
of all bacteria but rather depends on the specific species 
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FIG. 4. Probability density function W{n) of the length 
of the non-coding regions ?i in which an inverted repeat of 
stem length £ — 4 (green line), £ — 6 (blu line) and £ > 8 
(red line) is found. Data refers to the genome of E. coli. The 
black dashed line is obtained from the null random hypothesis 
discussed in the text. The four panels show the data for the 
four subsets of IRs defined in the text. Specifically, A5' IRs 
are in (a), B5' in (b), B3' in (c) and C3' in (d). 

We have shown that for several eubacteria the average 
distance from the nearest coding region of IRs character- 
ized by a large value of the stem length (usually £ > 8) 
decreases when the IR is located nearby a 3' end of a cod- 
ing regions. This behavior may be due to two different 
hypothesis. First, the longest IRs locate preferentially 
in short non-coding regions. Second, for each non-coding 
region of a given length the longest IRs belonging to sub- 
sets B3' and C3' tend to be located closer to the 3' end 
of the bounding gene than expected under a random as- 
sumption. The test of the second hypothesis is statisti- 
cally difficult because the number of IRs is not sufficient 
to give a robust estimation of the mean distance of the 
IRs from the 3' end of the bounding gene conditioned to 
the length of the non-coding region in which the IR is 
found. For this reason, we test only the first hypothesis 
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in the present study. In order to test the first hypothesis 
we investigate the probabiUty W{n) that an IR, belong- 
ing to one of the four groups previously defined, is found 
in a non-coding region of length n. Specifically, for each 
IR found in non-coding DNA we associate the length of 
the non-coding region in which the IR is located. We de- 
termine the histogram of the length of the non-coding re- 
gions corresponding to each IR for the stem length i = A, 
£ = 6 and ^ > 8 for the four subsets A5', B5', B3' and C3'. 
This is shown in Figure 4 for the genome of E. coli. The 
used values of I have been selected because they are rep- 
resentative of the behavior observed for small, medium 
and large values of the stem length. As a null hypothe- 
sis we assume that the probability that an IR. which is 
belonging to one of the four subsets, is found in a non- 
coding region of a length n is equal to the length of the 
non-coding region divided by the total length ritot of the 
non-coding regions of the same type (A, B or C). The 
Figure 4 shows the probability density function expected 
according to this null hypothesis as a black dashed line. 
This reference probability density function 



W{n) = A P{n) 

ntot 



(2) 



is obtained by multiplying the empirical probability den- 
sity function P{n) of the length of non-coding regions 
measured in the corresponding group of regions times 
n/ritot, i-e. the length of the non-coding region divided 
by the total length of non-coding regions belonging to 
the considered group. The parameter A is a normalizing 
constant. Panels (a) and (b) show that the probability 
density function W{n) of IRs of subsets A5' and B5' is 
described quite well by the null hypothesis for all values 
of I. Panel (c) shows that the probability density func- 
tion W{n) of IRs of subset B3' is in good agreement with 
the null hypothesis for £ = 4 and I ~ & whereas a small 
discrepancy is observed for ^ > 8. This discrepancy be- 
comes extremely evident in panel (d) which is showing 
W{n) for IRs belonging to subset C3'. From the analysis 
of Figure 3 we conclude that long IRs located in a type C 
non-coding region are found in short non-coding regions 
with a much higher probability than expected from a null 
random hypothesis. 

The final investigation concerns a comparison between 
the IRs found by us in the non-coding regions of several 
eubacteria and the intrinsic terminators predicted with 
bioinformatics methods in Ref. In this study the 

biological information summarized in the review paper 
of Ref. [ p^ are used to train a computer program able 
to detect potential rho-independent intrinsic terminators 
in DNA sequences. The optimization of the parame- 
ters of the program was performed to ensure that 89% 
of the rho-independent intrinsic terminators known from 
biological studies are identified by using the 98% confi- 
dence threshold. We have used the computer program 
TransTerm of Ref. p3 to detect how many of the IRs 



with £ > 8 we find in the non-coding regions of several eu- 
bacterial genomes are predicted as rho-independent ter- 
minators by TransTerm. 
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FIG. 5. Probability density function P{dy) of a subset of 
IRs with stem length ^ > 8 located in type B and C non 
coding regions. The subset is defined by considering the IRs 
which are nearby the 3' end of a coding region and which 
are not predicted as rho-independent intrinsic terminators by 
the TransTerm program at a 98% confidence threshold in the 
six genomes of (a) Bacillus halodurans (129 IRs), (b) Bacil- 
lus subtilis (459 IRs), (c) Neisseria meningitidis serogroup B 
(169 IRs), (d) Escherichia coli (241 Irs), (e) Pseudomonas 
aeruginosa (279 IRs) and (f) Vibrio cholerae (172 IRs). 

By selecting the confidence level of 98% recommended 
by the authors, we note that only a part of the IRs lo- 
calized nearby the end of the nearest coding regions are 
predicted as rho-independent terminators by TransTerm. 
Hundreds of IRs remain not predicted as potential rho- 
independent terminators by TransTerm. We verify that 
these IRs maintain the characteristic of being localized 
at a typical distance from the end of the nearest cod- 
ing regions in several genomes very rich in IRs. Fig- 
ure 5 shows the inverted repeats with £ > 8 which are 
not predicted as rho-independent intrinsic terminators 
by TransTerm in the six genomes of Bacillus halodu- 
rans, Bacillus subtilis, Neisseria meningitidis serogroup 
B, Escherichia coli, Pseudomonas aeruginosa and Vibrio 
cholerae. In all these cases the IRs are still localized near 
the end of the nearest coding regions. 



Conclusion 

In summary, a comparative statistical investigation of 
the number and location of IRs in several different com- 
plete genomes shows that a large number of them may 
play a role in several eubacteria as intrinsic terminators 
of the transcription process. For almost all the bacteria 
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investigated, we quantitatively show that the IRs locates 
preferentially in non-coding regions. Moreover, in sev- 
eral eubacteria, IRs characterized by large values of the 
stem length € locate preferentially in the non-coding re- 
gions bounded by two 3' ends of convergent genes. We 
also show statistical evidence that long IRs are found in 
short non-coding region more frequently than expected 
under a null random hypothesis. A large number of the 
IRs detected with our statistical methodology near the 
3' end of a coding region are different from the rho- 
indepcndent transcription terminators detected with a 
specialized computer program trained by using molecu- 
lar biological information exclusively determined by in- 
vestigating transcription process in Escherichia coli. Our 
findings based on the comparative analysis of non-coding 
regions of complete genomes suggest that other forms of 
intrinsic termination may be active in several eubacteria. 

Inverted repeats are not only involved in transcription 
termination in bacteria. Indeed, by performing our com- 
parative genomic study we are able to show that short 
IRs are slightly more abundant than expected under a 
null hypothesis in type A non-coding regions of several 
genomes. This may be due to nucleotide concentration 
fluctuation correlated to the type of non-coding region 
considered or may indicate a potential biological role of 
some of these short IRs consistent with known results of 
the literature [^-^ . The role of IRs with short values of 
£ located in type A non-coding regions will be considered 
in a future study. 
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TABLE I. Inverted repeats per million base pair of stem length i and loop length within the interval 3 < m < 10 detected 
in 37 complete genomes. 



Genome 




I = 6 




I = 7 




/ = 8 




/ > 8 


A. pemix 


1519.44 


^1388.88) 


444.39 


( 344.97) 


116.19 


(110.20) 


45.52 


( 26.95) 


A. fulgidus 


1387.72 


;i325.74) 


302.52 


( 331.89) 


94.11 


( 81.71) 


25.71 


( 34.89) 


Halobacterium sp. 


2893.40 


;2450.06) 


938.32 


( 728.31) 


328.66 


(215.96) 


171.78 


( 86.88) 


M. jannaschii 


2258.90 


^2378.42) 


706.32 


( 689.50) 


219.82 


(211.42) 


137.54 


( 81.08) 


M. thermoautotrophicum 


1573.62 ( 


1272.71) 


435.66 


( 341.45) 


108.49 


( 80.51) 


34.83 


( 23.98) 


P. ahyssi 


1291.13 


J1328.52) 


339.35 


( 328.59) 


92.91 


( 83.28) 


39.66 


( 31.16) 


P. horikoshii 


1396.03 


;i414.43) 


371.58 


( 389.99) 


101.81 


(104.69) 


48.32 


( 40.84) 


T. acidophilum 


1455.68 


;i229.47) 


398.11 


( 318.87) 


114.38 


( 78.60) 


46.65 


( 34.51) 


C. pneumoniae AR39 


1916.49 


;i428.63) 


541.53 


( 366.71) 


187.83 


( 94.32) 


135.79 


( 31.71) 


C. pneumoniae CWL029 


1915.90 


;i402.99) 


541.36 


( 403.99) 


188.58 


(101.61) 


132.50 


( 24.39) 


C. pneumoniae J138 


1921.41 


;i465.48) 


538.16 


( 381.84) 


188.88 


( 86.30) 


135.15 


( 33.38) 


C. trachomatis 


1997.36 


'1439.11) 


575.08 


( 401.16) 


205.72 


( 98.18) 


165.51 


( 29.92) 


C. trachomatis ser D 


1956.80 


'1391.82) 


638.84 


( 335.73) 


195.68 


( 95.92) 


175.54 


( 34.53) 


B. halodurans 


1415.16 


;i289.04) 


376.69 


( 340.29) 


103.04 


( 90.90) 


143.02 


( 32.36) 


B. subtilis 


1492.83 


;i408.13) 


404.76 


( 355.41) 


121.95 


(101.07) 


216.14 


( 28.47) 


M. genitalium 


3192.70 


;2628.98) 


1030.90 


( 656.81) 


330.99 


(237.90) 


160.32 


( 89.64) 


M. pneumoniae 


2163.17 


;i618.09) 


608.77 


( 436.06) 


224.16 


(104.12) 


107.79 


( 53.90) 


U. urealyticum 


4244.94 


;3824.57) 


1416.75 


;i219.87) 


525.46 


(401.75) 


323.26 


(183.58) 


M. tubercolosis 


2179.97 


;i947.85) 


638.55 


( 560.80) 


202.20 


(158.45) 


100.87 


( 56.67) 


R. prowazekii 


3131.74 


;2724.19) 


957.25 


( 853.78) 


287.89 


(229.41) 


166.44 


(103.46) 


N. meningitidis ser A 


1649.88 


;i463.56) 


432.61 


( 397.36) 


417.96 


( 96.14) 


231.64 


( 39.37) 


N. meningitidis ser B 


1648.07 


;i419.68) 


462.52 


( 375.38) 


415.43 


( 94.62) 


239.40 


( 34.77) 


C. jejuni 


3334.18 


;2939.42) 


1019.81 


( 883.35) 


339.33 


(252.82) 


226.62 


(122.45) 


H. pylori 26695 


2159.04 


;i960.59) 


585.78 


( 549.80) 


204.45 


(175.07) 


90.53 


( 65.35) 


H. pylory J99 


2101.19 


;i885.84) 


605.29 


( 537.16) 


184.33 


(147.22) 


83.34 


( 55.97) 


Buchnera sp. 


4807.38 


;3469.75) 


1793.40 


,1052.01) 


616.53 


(334.02) 


358.99 


(163.89) 


E. coll 


1441.62 


;i216.58) 


419.90 


( 322.04) 


147.22 


( 76.52) 


154.98 


( 27.16) 


H. influenzae 


2015.69 


; 1748. 50) 


596.13 


( 469.36) 


181.95 


(139.88) 


204.36 


( 62.84) 


P. aeruginosa 


2422.58 


^2072. 34) 


740.37 


( 563.02) 


251.74 


(167.29) 


182.30 


( 58.90) 


V. cholerae Chr I 


1508.87 


;i266.40) 


414.37 


( 323.19) 


136.77 


( 85.10) 


215.79 


( 29.04) 


X. fastidiosa 


1391.03 


;il55.52) 


397.12 


( 301.20) 


103.01 


( 74.65) 


76.51 


( 27.99) 


A. aeolicus 


1489.68 


;i518.69) 


415.77 


( 384.83) 


107.00 


(109.58) 


35.45 


( 35.45) 


B, burgdorferi 


3235.89 


;2847.19) 


1068.38 


( 908.07) 


332.70 


(311.84) 


262.43 


(124.08) 


T. pallidum 


1539.53 


;i213.52) 


424.42 


( 304.04) 


100.17 


( 76.45) 


69.42 


( 21.09) 


D. radiodurans 


1545.32 


;2050.11) 


345.46 


( 584.83) 


81.93 


(161.59) 


62.30 


( 65.32) 


Synechocystis sp. 


1096.13 


;i379.61) 


223.59 


( 378.34) 


47.01 


( 94.59) 


12.03 


( 36.94) 


T. maritima 


1609.05 


;i399.99) 


447.14 


( 382.65) 


140.81 


(107.48) 


116.62 


( 25.26) 



The number in parenthesis is the theoretical prediction obtained from computer data generated by using a local first- 
order Markov process. We use the color code described in the text to quantify the discrepancy between the empirical 
results and the predictions of the Markov model. 
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TABLE II. p-values associated with the test of the null hypothesis discussed in the text on the distribution of inverted 
repeats in the different types of non-coding regions for 37 bacterial genomes. 



Genome 


I = 6 


I = 7 


I = 8 


I > 8 


A. pernix 


69.68 


49.54 


78.37 


82.65 


A. fulgidus 


25.53 


44.92 


6.75 


75.39 


Halobacterium 


0.47 


19.94 


61.33 


0.33 


M. jannaschii 


42.14 


2.33 


3.09 


90.29 


M. thermoautotrophicum 


2.16 


28.95 


2.66 


88.31 


P. abyssi 


54.75 


7.50 


3.23 


41.93 


P. horikoshii 


6.59 


23.62 


70.80 


39.27 


T. acidophilum 


5.80 


2.50 


27.14 


69.38 


C. pneumoniae AR39 


10.70 


37.87 


6.16 


8.28 


C. pneumoniae CWL029 


1.38 


12.45 


11.98 


4.35 


C. pneumoniae J138 


7.01 


29.77 


42.17 


0.00 


C. trachomatis 


70.56 


79.23 


89.88 


0.00 


C. trachomatis ser D 


0.85 


48.25 


95.20 


0.00 


B. halodurans 


10.10 


50.04 


88.75 


0.00 


B. subtilis 


8.03 


49.84 


1.72 


0.00 


M. gemtahum 


97.27 


23.22 


22.19 


24.69 


M. pneumoniae 


11.97 


60.14 


38.18 


28.93 


U. urealyticum 


0.18 


21.55 


9.51 


0.48 


M. tubercolosis 


86.16 


23.36 


70.71 


3.73 


R. prowazekii 


2.19 


58.18 


88.54 


87.07 


N. meningitidis ser A 


13.71 


21.99 


0.00 


0.00 


TV. meningitidis ser B 


0.33 


0.08 


0.00 


0.00 


C. jejuni 


0.02 


49.08 


93.59 


0.00 


H. pylori 26695 


17.32 


0.04 


97.81 


11.29 


H. pylory J99 


28.40 


87.26 


73.48 


0.01 


Buchnera sp. 


0.03 


4.37 


42.02 


11.16 


E. coll 


3.06 


0.16 


0.00 


0.00 


H. influenzae 


0.86 


38.33 


0.20 


0.00 


P. aeruginosa 


1.54 


0.00 


0.00 


0.00 


V. cholerae Chr I 


0.64 


19.41 


0.00 


0.00 


X. fastidiosa 


78.58 


35.77 


17.10 


2.02 


A. aeolicus 


69.94 


32.28 


24.83 


66.97 


B. burgdorferi 


0.01 


14.34 


25.77 


0.00 


T. pallidum 


21.35 


91.15 


0.99 


6.45 


D. radiodurans 


0.03 


0.00 


0.03 


0.00 


Synechocystis sp. 


1.46 


40.93 


1.59 


33.15 


T. maritima 


20.98 


27.83 


2.59 


0.00 



TABLE III. Mean distance of the inverted repeats from the closest gene boundary for the four groups of inverted repeats 
defined in the text for six bacterial genomes. 







A5' 




B5' 




B3' 




C3' 


Genome 


l = Q 


^ > 8 


l = Q 


^ > 8 


l = Q 


^ > 8 


^ = 6 


^ > 8 


B. halodurans 


97.41 


111.2 


106.3 


106.1 


84.49 


56.28 


67.34 


30.59 


B. subtilis 


86.84 


81.90 


78.82 


83.79 


61.96 


30.12 


79.75 


25.86 


E. coll 


99.71 


89.64 


85.96 


80.00 


74.41 


39.31 


80.85 


32.19 


N. meningitidis ser B 


100.9 


83.19 


123.5 


100.7 


97.46 


41.47 


154.0 


48.92 


P. aeruginosa 


91.26 


117.0 


76.01 


77.64 


66.04 


38.87 


80.46 


35.95 


V. cholerae Chr I 


93.39 


110.7 


95.52 


40.96 


62.61 


38.90 


84.62 


41.86 
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